Introduction to R
for Social Scientists

Workshop Day 3A | 2022-07-27
Jeffrey M. Girard | Pitt Methods

Model I

Data Verification

  • It’s always a good idea to start with verification
  • Check that your variables are the correct type
    • Configure your factors’ levels and labels
    • Establish ordinal factors’ ordering
    • Explicitly set your missing values to NA
  • Check variables’ extrema and distributions
    • Check for erroneous and outlying values
    • Check the shape of continuous distributions
    • Check the overlap of categorical levels

Data Verification Live Coding

# SETUP:

library(tidyverse)
library(datawizard)

salaries <- read_csv("salaries.csv")

# ==============================================================================

# LESSON: First, set all your categorical variables to factors

salaries

salaries <- 
  salaries |> 
  mutate(across(c(rank, discipline, sex), factor)) |> 
  print()

# ==============================================================================

# LESSON: Then check the summary statistics for problems

summary(salaries)

describe_distribution(salaries)

data_tabulate(salaries, exclude = is.numeric)

# ==============================================================================

# LESSON: Finally, check for empty level-combinations

salaries |> 
  group_by(rank, discipline, sex) |> 
  summarize(n = n()) |> 
  arrange(n)

Distribution Geoms

  • Variable distributions are critical in data analysis
    • What are the most and least common values?
    • What are the extrema (min and max values)?
    • Are there any outliers or impossible values?
    • How much spread is there in the variable?
    • What shape does the distribution take?
  • Visualization is a quick way to assess this
    • They can also communicate it to others

Distribution Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

salaries <- read_csv("salaries.csv")

# ==============================================================================

# USECASE: Creating histograms

ggplot(salaries, aes(x = salary)) + 
  geom_histogram()

ggplot(salaries, aes(x = salary)) + 
  geom_histogram(bins = 20)

ggplot(salaries, aes(x = salary)) + 
  geom_histogram(binwidth = 2)

ggplot(salaries, aes(x = salary)) + 
  geom_histogram(binwidth = 2, color = "red", size = 1)

ggplot(salaries, aes(x = salary)) + 
  geom_histogram(binwidth = 2, color = "red", size = 1, fill = "white")

# ==============================================================================

# USECASE: Creating density plots

ggplot(salaries, aes(x = salary)) + geom_density()

ggplot(salaries, aes(x = salary)) + 
  geom_density(color = "red", size = 1, fill = "white")

# ==============================================================================

# USECASE: Creating box plots

ggplot(salaries, aes(x = salary)) + geom_boxplot()

ggplot(salaries, aes(x = salary, y = rank)) + 
  geom_boxplot(varwidth = TRUE)

# ==============================================================================

# USECASE: Creating bar plots to count categorical variables

ggplot(salaries, aes(x = rank)) + geom_bar()

# ==============================================================================

# PITFALL: Don't try to create histograms for categorical variables

ggplot(salaries, aes(x = rank)) + geom_histogram() #error

Correlations

  • Correlations \((r)\) quantify the strength of the linear relationship between two variables from \(-1\) to \(+1\)
    • \(r\rightarrow-1\): as \(x\) increases, \(y\) decreases
    • \(r\rightarrow0\): as \(x\) increases, \(y\) doesn’t change
    • \(r\rightarrow+1\): as \(x\) increases, \(y\) also increases
  • Correlations may be the focus of statistical inference or just useful descriptive summaries
  • We can easily estimate many different types of correlation coefficients in R

Correlations Live Coding

# SETUP: Load the used packages (if needed) and read in the example dataset

library(tidyverse)
library(correlation)

salaries <- read_csv("salaries.csv")

# ==============================================================================

# LESSON: The standard correlation test in R

cor.test(salaries$salary, salaries$yrs.since.phd)

# ==============================================================================

# TIP: Fancier correlations with the {correlation} package (from {easystats})

correlation(salaries)

# ==============================================================================

# LESSON: Creating a graphical model plot (requires installing some packages)

correlation(salaries) |> plot()

# ==============================================================================

# LESSON: Creating and plotting the correlation matrix

correlation(salaries) |> summary()

correlation(salaries) |> summary() |> plot()

correlation(salaries) |> summary(redundant = TRUE) |> plot()

# ==============================================================================

# p-value adjustments

correlation(salaries, p_adjust = "none") # very liberal

correlation(salaries, p_adjust = "bonferroni") # very conservative

correlation(salaries, p_adjust = "holm") # recommended

# ==============================================================================

# LESSON: Other correlation methods

correlation(salaries, method = "kendall") # rank correlation

correlation(salaries, method = "blomqvist") # similar to kendall but better

correlation(salaries, method = "distance") # linear and nonlinear relationships

Comparing Two Groups

  • A fundamental task in science is comparing the centrality of two groups’ distributions
    • It is common to compare means or medians
  • Independent groups are separate
    • Comparisons are between subjects
    • Did students in New York or students in California spend more time on social media?
  • Dependent groups are paired
    • Comparisons are within subjects
    • Did the students in my class spend more time on social media during the winter or the summer?

Independent Groups Live Coding

# SETUP: Load the used packages (if needed) and read in the example dataset

library(tidyverse)
library(statsExpressions)
library(ggstatsplot)

salaries <- read_csv("salaries.csv")

# ==============================================================================

# USECASE: Let's compare the salaries across the two disciplines

two_sample_test(
  data = salaries, 
  x = discipline, 
  y = salary, 
  type = "parametric"
)

# ==============================================================================

# LESSON: We can also estimate each group's means by swapping the function

centrality_description(
  data = salaries,
  x = discipline, 
  y = salary, 
  type = "parametric"
)

# ==============================================================================

# LESSON: Nonparametric does not assume a normal distribution of y

two_sample_test(
  data = salaries, 
  x = discipline, 
  y = salary, 
  type = "nonparametric"
)

centrality_description(
  data = salaries, 
  x = discipline, 
  y = salary, 
  type = "nonparametric"
)

# ==============================================================================

# TIP: Calculate the test results and generate a plot using {ggstatsplot}

ggbetweenstats(
  data = salaries, 
  x = discipline, 
  y = salary, 
  type = "parametric"
)

ggbetweenstats(
  data = salaries, 
  x = discipline, 
  y = salary, 
  type = "nonparametric"
)

Dependent Groups Live Coding

# SETUP: Load the used packages (if needed) and read in the example dataset

library(tidyverse)
library(statsExpressions)
library(ggstatsplot)

interpersonal <- read_csv("interpersonal.csv")
interpersonal

# ==============================================================================

# LESSON: Be sure to set paired to TRUE and provide a subject.id

two_sample_test(
  data = interpersonal,
  x = time,
  y = problems,
  paired = TRUE,
  subject.id = patient,
  type = "parametric"
)

centrality_description(
  data = interpersonal,
  x = time,
  y = problems,
  type = "parametric"
)

# ==============================================================================

# LESSON: There is a nonparametric version of this too

two_sample_test(
  data = interpersonal,
  x = time,
  y = problems,
  paired = TRUE,
  subject.id = patient,
  type = "nonparametric"
)

centrality_description(
  data = interpersonal,
  x = time,
  y = problems,
  type = "nonparametric"
)

# ==============================================================================

# TIP: No need to add paired because we are switching to ggwithinstats()

ggwithinstats(
  data = interpersonal, 
  x = time, 
  y = problems,
  subject.id = patient,
  type = "parametric"
)

ggwithinstats(
  data = interpersonal, 
  x = time, 
  y = problems,
  subject.id = patient,
  type = "nonparametric"
)

Comparing Many Groups

  • We may also compare more than two groups
    • Independent vs. dependent still applies
  • An omnibus test asks whether groups are different
    • If significant, one or more group is different
  • Posthoc tests ask how groups are different
    • Pairwise tests compare each pair of groups
    • They are often “gated” behind omnibus tests
    • With many groups, p-values are adjusted

Many Independent Live Coding

# SETUP: Load the used packages (if needed) and read in the example dataset

library(tidyverse)
library(statsExpressions)
library(ggstatsplot)

salaries <- read_csv("salaries.csv")

# ==============================================================================

# USECASE: Let's compare the salaries across the three ranks

oneway_anova(
  data = salaries,
  x = rank,
  y = salary,
  type = "parametric"
)

centrality_description(
  data = salaries,
  x = rank,
  y = salary,
  type = "parametric"
)

oneway_anova(
  data = salaries,
  x = rank,
  y = salary,
  type = "nonparametric"
)

# ==============================================================================

# USECASE: Let's compare the salaries between each pair of ranks

pairwise_comparisons(
  data = salaries,
  x = rank,
  y = salary,
  type = "parametric"
)

pairwise_comparisons(
  data = salaries,
  x = rank,
  y = salary,
  type = "nonparametric"
)

# ==============================================================================

# TIP: Do it all (overall and pairwise tests) with ggbetweenstats

ggbetweenstats(
  data = salaries, 
  x = rank, 
  y = salary,
  type = "parametric"
)

ggbetweenstats(
  data = salaries, 
  x = rank, 
  y = salary,
  type = "nonparametric"
)

Many Dependent Live Coding

# SETUP: Load the used packages (if needed) and read in the example dataset

library(tidyverse)
library(statsExpressions)
library(ggstatsplot)

elicitation <- read_csv("elicitation.csv")
elicitation

# ==============================================================================

# USECASE: Let's compare the amusement self-reports across the four tasks

oneway_anova(
  data = elicitation,
  x = Task,
  y = Amused,
  paired = TRUE,
  subject.id = Subject,
  type = "parametric"
)

centrality_description(
  data = elicitation,
  x = Task,
  y = Amused,
  type = "parametric"
)

oneway_anova(
  data = elicitation,
  x = Task,
  y = Amused,
  paired = TRUE,
  subject.id = Subject,
  type = "nonparametric"
)

# ==============================================================================

# USECASE: Let's compare the amusement self-reports between each pair of tasks

pairwise_comparisons(
  data = elicitation,
  x = Task,
  y = Amused,
  paired = TRUE,
  subject.id = Subject,
  type = "parametric"
)

pairwise_comparisons(
  data = elicitation,
  x = Task,
  y = Amused,
  paired = TRUE,
  subject.id = Subject,
  type = "nonparametric"
)

# ==============================================================================

# TIP: Do it all (overall and pairwise tests) with ggwithinstats

ggwithinstats(
  data = elicitation,
  x = Task,
  y = Amused,
  paired = TRUE,
  subject.id = Subject,
  type = "parametric"
)

ggwithinstats(
  data = elicitation,
  x = Task,
  y = Amused,
  paired = TRUE,
  subject.id = Subject,
  type = "nonparametric"
)

Practice V